Skip to content

Rework Linker dispatching for cross-major nvJitLink/driver skew#1911

Open
cpcloud wants to merge 8 commits intoNVIDIA:mainfrom
cpcloud:linker-dispatch-rework-712
Open

Rework Linker dispatching for cross-major nvJitLink/driver skew#1911
cpcloud wants to merge 8 commits intoNVIDIA:mainfrom
cpcloud:linker-dispatch-rework-712

Conversation

@cpcloud
Copy link
Copy Markdown
Contributor

@cpcloud cpcloud commented Apr 14, 2026

Summary

  • Replaces module-level "decide once" backend selection with per-Linker-instance dispatch at __init__ time
  • Factors decision into pure _choose_backend() helper for GPU-free unit testing
  • Handles nvJitLink/driver major-version mismatches: falls back to driver linker for non-LTO linking, raises RuntimeError for LTO when backends are incompatible
  • Probes driver_version() lazily — environments with nvJitLink but no driver (build containers) still work
  • _probe_nvjitlink() cached, warns at most once when nvJitLink is absent

Breaking change: options.link_time_optimization=True with nvJitLink absent now raises RuntimeError instead of silently passing CU_JIT_LTO to the driver (which was not real LTO linking).

Decision matrix

driver nvJitLink ltoir input lto/ptx result
any None no no driver
any None yes/lto raise
M (M,*) any any nvJitLink
D≠N (N,*) no no driver fallback
D≠N (N,*) yes/lto raise
None available any any nvJitLink

Test plan

  • GPU-free parameterized tests for full decision matrix (test_linker_dispatch.py)
  • Test helpers handle driver-version failure gracefully
  • CI: existing GPU tests pass with per-instance dispatch
  • CI: cross-major behavior verified (requires multiple CTK versions)

Closes #712

🤖 Generated with Claude Code

@cpcloud cpcloud added this to the cuda.core v1.0.0 milestone Apr 14, 2026
@cpcloud cpcloud added enhancement Any code-related improvements P0 High priority - Must do! cuda.core Everything related to the cuda.core module breaking Breaking changes are introduced labels Apr 14, 2026
@cpcloud cpcloud self-assigned this Apr 14, 2026
@github-actions
Copy link
Copy Markdown

@cpcloud cpcloud force-pushed the linker-dispatch-rework-712 branch 2 times, most recently from 0c94703 to 5d8fa24 Compare April 16, 2026 21:58
@cpcloud cpcloud requested review from leofang and mdboom April 17, 2026 13:07
@cpcloud cpcloud force-pushed the linker-dispatch-rework-712 branch from 61ea4ff to a259b8d Compare April 17, 2026 15:54
cpcloud and others added 8 commits April 18, 2026 05:44
Replace the module-level "decide once, use everywhere" nvJitLink-vs-driver
choice with a per-Linker-instance decision that considers the CUDA driver
major version, nvJitLink's availability and major version, the input code
types, and whether link-time optimization is requested.

The dispatch is factored into a pure helper `_choose_backend()` that is
fully unit-testable without a GPU. Its decision matrix:

- no nvJitLink, no LTO  -> driver
- matching majors       -> nvJitLink
- cross-major, no LTO   -> driver (nvJitLink output may not be loadable)
- LTO + no nvJitLink    -> RuntimeError
- LTO + cross-major     -> RuntimeError

This resolves the cross-major-driver scenario described in NVIDIA#712, where an
nvJitLink 12.x may produce a CUBIN the driver 13.x (or vice versa) cannot
load. The previous code committed to nvJitLink unconditionally when it was
importable.

Tests:

- `tests/test_linker_dispatch.py` parametrizes the entire matrix against
  `_choose_backend()` with mocked versions (no GPU, no driver required).
- `tests/test_linker.py::TestLinkerDispatch` drives the same decision
  through the real `Linker` constructor via monkeypatched version probes.
- `tests/test_optional_dependency_imports.py` is updated to exercise the
  new `_probe_nvjitlink()` helper in place of the removed
  `_decide_nvjitlink_or_driver()`.
- `tests/test_program.py` and `tests/test_linker.py` use a small local
  helper to compute the effective backend for the current environment.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
driver_version() was called unconditionally during Linker.__init__,
which fails in environments where nvJitLink is installed but the
CUDA driver is absent (e.g., build containers). Now catches the
exception and sets driver_major=None. When driver_major is unknown
and nvJitLink is available, optimistically selects the nvJitLink
backend.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Test helpers calling driver_version() at module scope would crash
in no-driver environments before test collection. Mirror the
production lazy-probe pattern: catch exceptions and pass None.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Linker_link was nulling self._drv_log_bufs right after cuLinkComplete,
releasing the bytearrays whose addresses were handed to the driver via
CU_JIT_INFO_LOG_BUFFER and CU_JIT_ERROR_LOG_BUFFER at cuLinkCreate time.
The CUlinkState retains those pointers until cuLinkDestroy, which runs
during Linker tp_dealloc. Freeing the bytearrays first left the driver
with dangling pointers and corrupted the heap; subsequent CUDA calls
(e.g. NVRTC compilation in the next test fixture) segfaulted.

This path became reachable in CI with the new per-instance backend
dispatch: CTK 12.9.1 + driver 13.0 runners now hit the driver linker
for cross-major linking, which was never exercised before.

Retain _drv_log_bufs until the cdef class is deallocated; pxd
declaration order ensures _culink_handle (and therefore cuLinkDestroy)
runs before the bytearrays are cleared.
…etime

The CUDA driver docs state: "optionValues must remain valid for the life
of the CUlinkState if output options are used." The driver writes log-
fill sizes (output) back into the optionValues slots for
CU_JIT_INFO_LOG_BUFFER_SIZE_BYTES and CU_JIT_ERROR_LOG_BUFFER_SIZE_BYTES.
Linker_init previously declared c_jit_keys/c_jit_values as local
cdef vector[...] on the stack of Linker_init; they were destroyed when
the function returned, leaving the driver with dangling writes during
subsequent cuLinkAddData/cuLinkComplete/cuLinkDestroy calls.

This was always latent. It became reachable with the per-instance
backend dispatch (CTK 12.9.1 runners now select the driver linker when
they pair with a driver 13 install), and only manifested on driver 13
as heap corruption that killed the next NVRTC or link call.

Promote the two arrays to cdef class fields declared after
_culink_handle in the pxd. Cython's tp_dealloc destroys C++ fields in
pxd declaration order, so the vectors are destroyed after the shared_ptr
deleter runs cuLinkDestroy. The cuda.bindings high-level wrapper
(driver.cuLinkCreate) already handles this by attaching a keepalive to
CUlinkState; cuda.core's low-level cydriver.cuLinkCreate path did not.

Also drop the now-unused void_ptr ctypedef.
The as_bytes() method raises ValueError for unsupported backends (per
its docstring and matching the test directly above this one). The
driver-backend skip-guarded test was asserting RuntimeError, so it
always failed on CTK 12.9.1 runners where the skip condition does not
apply.
Adds parametrized cases for the build-container path where
cuDriverGetVersion is unqueryable: with nvJitLink present the
dispatcher picks nvjitlink optimistically; with nvJitLink absent
it falls back to driver for non-LTO and raises for LTO.

These paths are documented in _choose_backend's contract but were
previously uncovered.
@cpcloud cpcloud force-pushed the linker-dispatch-rework-712 branch from a259b8d to 53541a6 Compare April 18, 2026 11:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

breaking Breaking changes are introduced cuda.core Everything related to the cuda.core module enhancement Any code-related improvements P0 High priority - Must do!

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Re-work on Linker dispatching logic

1 participant